PA 3: Identify the Mystery College 🏫

Data Wrangling with dplyr

Author

Names of Your Group!

Today you will use the dplyr package to clean some data. We will then use that cleaned data to figure out what college Margaret has been accepted to.

This task is complex. It requires many different types of abilities. Everyone will be good at some of these abilities but nobody will be good at all of them. In order to produce the best product possible, you will need to use the skills of each member of your group.

Part One: Data Import and Cleaning

This section will clean today’s data so that you can use it more easily in Part Two.

Data Download & Package Loading

First, we declare our package dependencies and load the data.

The data loading function read_csv() will give you an outpouring of helpful information about the dataset. If you do not see the word “error”, there is nothing to be concerned about.

library(tidyverse)

colleges <- read_csv("https://www.dropbox.com/s/bt5hvctdevhbq6j/colleges.csv?dl=1")

Take a look at the variables in your downloaded data by running the following code. This tells reports the data type for each column in the dataset.

glimpse(colleges)

Data Cleaning

Now we will clean the data. Alas, each of the R chunks in this section will cause an error and / or do the desired task incorrectly. Even the chunks that run without error are not correct! You will need to find the mistake and correct it to complete the intended action.

Step 1: There are too many variables in this data set. We don’t need all of them. Narrow your data set down to only:

  • INSTNM name of the institution
  • CITY city, STABBR state, and ZIP ZIP code of the institution
  • ADM_RATE admissions rate
  • SAT_AVG average SAT score
  • UGDS number of undergraduate students
  • TUITIONFEE_IN in- and TUITIONFEE_OUT out-of-state tuition
  • CONTROL Whether the school is public or private
  • REGION region of the school.
colleges_clean1 <- colleges | >  
  select(INSTNM, 
         CITY, 
         STABBR, 
         ZIP,
         ADM_RATE, 
         SAT_AVG, 
         UGDS,
         TUITIONFEE_IN, 
         TUITIONFEE_OUT
         CONTROL,
         REGION) 

Step 2: Remove the schools that are private and for-profit (category 3).

colleges_clean2 <- colleges_clean1 |> 
  filter(CONTROL == 3)

Step 3: Adjust the appropriate variables to be numeric, using as.numeric().

colleges_clean3 <- colleges_clean2 |> 
  mutate(TUITIONFEE_IN  = numeric(TUITIONFEE_IN),
         TUITIONFEE_OUT = numeric(TUITIONFEE_OUT),
         SAT_AVG        = numeric(SAT_AVG),
         UGDS           = numeric(UGDS),
         ADM_RATE       = numeric(ADM_RATE)) 

Step 4: Adjust the appropriate variables to be factors, using as.factor().

colleges_clean4 <- colleges_clean3 |>
  mutate(CONTROL = as.character(CONTROL),
         REGION  = as.character(REGION))

Step 5: Create a new variable called TUITION_DIFF which contains the difference between out-of-state and in-state costs.

colleges_clean5 <- colleges_clean4 |> 
    TUITION_DIFF = TUITIONFEE_OUT - TUITIONFEE_IN

Step 6: Remove every row with missing data.

colleges_clean6 <- colleges_clean5 |> 
  drop.na()

At this point we’ve made six different datasets, but notice each of these steps started with colleges_clean <- colleges_clean |>. That is pretty redundant! Instead, we could have performed all these tasks as one long “pipeline.”

Step 7: Combine your (fixed) code chunks into a single code chunk that carries out all of the steps necessary to clean the data.

Think about coding efficiency – you should not have multiple calls to the same function!

# Code combining ALL of your previous steps into ONE pipeline

colleges_clean <- colleges |>

Part Two: Identify the Mystery College

Wow! Your best friend Ephelia has been accepted to her top choice college! Unfortunately, Ephelia is a very mysterious person, and she won’t tell you directly which college this is. You’ll have to use her clues to figure out which school is her dream school.

Clues:

  1. This college is located in Region 7.

  2. This college’s admission rate is above the median rate for the region.

  3. This college does not charge the same for in- and out-of-state tuition.

  4. The average SAT score of this college is an odd number.

  1. This college is not in Idaho.

  2. Less than 10,000 undergraduates attend this college.

  3. The college is not in the state where Dr. Theobold received a Ph.D.

  4. Of the five options remaining at this step, Ephelia will attend the college that has the lowest admission rate.

Submit the college Ephelia will attend to the Canvas Quiz.